20 research outputs found
Self Supervision Does Not Help Natural Language Supervision at Scale
Self supervision and natural language supervision have emerged as two
exciting ways to train general purpose image encoders which excel at a variety
of downstream tasks. Recent works such as M3AE and SLIP have suggested that
these approaches can be effectively combined, but most notably their results
use small pre-training datasets (<50M samples) and don't effectively reflect
the large-scale regime (>100M examples) that is commonly used for these
approaches. Here we investigate whether a similar approach can be effective
when trained with a much larger amount of data. We find that a combination of
two state of the art approaches: masked auto-encoders, MAE and contrastive
language image pre-training, CLIP provides a benefit over CLIP when trained on
a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a
suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B
images. Our work provides some much needed clarity into the effectiveness (or
lack thereof) of self supervision for large-scale image-text training
On Robustness in Multimodal Learning
Multimodal learning is defined as learning over multiple heterogeneous input
modalities such as video, audio, and text. In this work, we are concerned with
understanding how models behave as the type of modalities differ between
training and deployment, a situation that naturally arises in many applications
of multimodal learning to hardware platforms. We present a multimodal
robustness framework to provide a systematic analysis of common multimodal
representation learning methods. Further, we identify robustness short-comings
of these approaches and propose two intervention techniques leading to
- robustness improvements on three datasets, AudioSet,
Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these
interventions better utilize additional modalities, if present, to achieve
competitive results of mAP on AudioSet 20K
TiC-CLIP: Continual Training of CLIP Models
Keeping large foundation models up to date on latest data is inherently
expensive. To avoid the prohibitive costs of constantly retraining, it is
imperative to continually train these models. This problem is exacerbated by
the lack of any large scale continual learning benchmarks or baselines. We
introduce the first set of web-scale Time-Continual (TiC) benchmarks for
training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with
over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first
use our benchmarks to curate various dynamic evaluations to measure temporal
robustness of existing models. We show OpenAI's CLIP (trained on data up to
2020) loses zero-shot accuracy on our curated retrieval task from
2021--2022 compared with more recently trained models in OpenCLIP repository.
We then study how to efficiently train models on time-continuous data. We
demonstrate that a simple rehearsal-based approach that continues training from
the last checkpoint and replays old data reduces compute by when
compared to the standard practice of retraining from scratch